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Abstract 

Consistent query answering over a database that violates primary key constraints is a 
classical hard problem in database research that has been traditionally dealt with logic 
programming. However, the applicability of existing logic-based solutions is restricted to 
data sets of moderate size. This paper presents a novel decomposition and pruning strat¬ 
egy that reduces, in polynomial time, the problem of computing the consistent answer to 
a conjunctive query over a database subject to primary key constraints to a collection of 
smaller problems of the same sort that can be solved independently. The new strategy is 
naturally modeled and implemented using Answer Set Programming (ASP). An experi¬ 
ment run on benchmarks from the database world prove the effectiveness and efficiency 
of our ASP-based approach also on large data sets. To appear in Theory and Practice of 
Logic Programming (TPLP), Proceedings of ICLP 2015. 

KEYWORDS'. Inconsistent Databases, Primary Key Constraints, Consistent Query An¬ 
swering, ASP 


1 Introduction 

Integrity constraints provide means for ensuring that database evolution does not 
result in a loss of consistency or in a discrepancy with the intended model of the 
application domain (lAbiteboul ct al. 19951) . A relational database that do not sat¬ 
isfy some of these constraints is said to be inconsistent. In practice it is not un¬ 
usual that one has to deal with inconsistent data (IBcrtossi et al. 2005|) . and when 
a conjunctive query (CQ) is posed to an inconsistent database, a natural problem 
arises that can be formulated as: How to deal with inconsistencies to answer the 
input query in a consistent way? This is a classical problem in database research 
and different approaches have been proposed in the literature. One possibility is 
to clean the database ( |Elmagarmid et al. 2007 ) and work on one of the possible 
coherent states; another possibility is to be tolerant of inconsistencies by leaving 
intact the database and computing answers that are “consistent with the integrity 
constraints” (lArcnas et al. 19991 IBertossi 20111) . 

In this paper, we adopt the second approach - which has been proposed by 
I Arenas et al.l (1199911 under the name of consistent query answering (CQA) - and 
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focus on the relevant class of primary key constraints. Formally, in our setting: (1) 
a database D is inconsistent if there are at least two tuples of the same relation 
that agree on their primary key; (2) a repair of D is any maximal consistent subset 
of D ; and (3) a tuple t of constants is in the consistent answer to a CQ q over D 
if and only if, for each repair R of D, tuple t is in the (classical) answer to q over 
R. Intuitively, the original database is (virtually) repaired by applying a minimal 
number of corrections (deletion of tuples with the same primary key), while the 
consistent answer collects the tuples that can be retrieved in every repaired instance. 

CQA under primary keys is coNP-complete in data complexity (lArenas ct al. 20031) . 
when both the relational schema and the query are considered fixed. Due to its com¬ 
plex nature, traditional RDBMs are inadequate to solve the problem alone via SQL 
without focusing on restricted classes of CQs (lArenas et al. 19991 IFuxman et al. 20051 
IFuxman and Miller 20071 |Wijsen 2009| [Wijsen 2012[ ) . Actually, in the unrestricted 
case, CQA has been traditionally dealt with logic programming (jGreco et al. 200ll 
lArenas et al. 20031 IBarcelo and Bertossi 20031 lEiter et al. 20031 IGreco et al. 20031 

IManna et al. 20 Kill . However, it has been argued (jKolaitis et al. 20131) that the 
practical applicability of logic-based approaches is restricted to data sets of moder¬ 
ate size. Only recently, an approach based on Binary Integer Programming (IKolaitis et al. 20131) 
has revealed good performances on large databases (featuring up to one million tu¬ 
ples per relation) with primary key violations. 

In this paper, we demonstrate that logic programming can still be effectively used 
for computing consistent answers over large relational databases. We design a novel 
decomposition strategy that reduces (in polynomial time) the computation of the 
consistent answer to a CQ over a database subject to primary key constraints into 
a collection of smaller problems of the same sort. At the core of the strategy is a 
cascade pruning mechanism that dramatically reduces the number of key violations 
that have to be handled to answer the query. 

Moreover, we implement the new strategy using Answer Set Programming (ASP) 

(|Gelfond an d Lifschitz 1991: Brcwk a~et al. 20fl1) . and we prove empirically the ef¬ 
fectiveness of our ASP-based approach on existing benchmarks from the database 
world. In particular, we compare our approach with some classical (IBarcelo and Bertossi 20031) 
and optimized (IManna ct al. 2013) encodings of CQA in ASP that were presented 
in the literature. The experiment empirically demonstrate that our logic-based ap¬ 
proach implements CQA efficiently on large data sets, and can even perform better 
than state-of-the-art methods. 


2 Preliminaries 

We are given two disjoint countably infinite sets of terms denoted by C and V and 
called constants and variables , respectively. We denote by X sequences (or sets, 
with a slight abuse of notation) of variables X±,..., X n , and by t sequences of terms 
t\., t n . We also denote by [n] the set {1,..., n}, for any 1. Given a sequence 
t = t ±,..., t n of terms and a set S = {p\,... ,pk} C [n], t|s is the subsequence 
t pi ,... ,t Pk . For example, if t = t\. ■ t% and S = {1, 3}, then t|s = t±, t^. 

A ( relational ) schema is a triple (1Z, a, k) where 1Z is a finite set of relation 
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symbols (or predicates), a : TZ —> N is a function associating an arity to each 
predicate, and k : 1Z 2 N is a function that associates, to each r G TZ, a nonempty 
set of positions from [a(r)], which represents the primary key of r. Moreover, for 
each relation symbol r G TZ and for each position i G [a(r)], r[i\ denotes the *-th 
attribute of r. Throughout, let E = (TZ, a, k) denote a relational schema. An atom 
(over E) is an expression of the form r(t\,..., t n ), where r G TZ, and n = a(r). An 
atom is called a fact if all of its terms are constants of C. Conjunctions of atoms 
are often identified with the sets of their atoms. For a set A of atoms, the variables 
occurring in A are denoted by var(A). A database D (over E) is a finite set of facts 
over E. Given an atom r(t) G D, we denote by t the sequence t| K ( r ). We say that 
D is inconsistent (w.r.t. E) if it contains two different atoms of the form r(ti) and 
r(t 2 ) such that ti = £ 2 - Otherwise, it is consistent. A repair R of D (w.r.t. E) is 
any maximal consistent subset of D. The set of all the repairs of D is denoted by 
rep(D, E). 

A substitution is a mapping : CUV -> CUV which is the identity on C. Given 
a set A of atoms, p(A) = {r(p(ti),... ,p(t n )) : r(t\,..., t n ) G A}. The restriction 
of p to a set S C C U V, is denoted by p\ 5 . A conjunctive query (CQ) q (over E) 
is an expression of the form 3Y y>(X, Y), where X U Y are variables of V, and ip 
is a conjunction of atoms (possibly with constants) over E. To highlight the free 
variables of q, we often write g(X) instead of q. If X is empty, then q is called a 
Boolean conjunctive query (BCQ). Assuming that X is the sequence X-[,..., X n , the 
answer to q over a database D, denoted q(D ), is the set of all n-tuples (t\, ..., t n ) G 
C" for which there exists a substitution p such that p,(<^(X, Y)) C D and p(X{) = 
L, for each i G [n]. A BCQ is true in D, denoted D |= q, if () G q(D). The consistent 
answer to a CQ g(X) over a database D (w.r.t. E), denoted ans(q,D,H), is the 
set of tuples {^\ R( z rep (D s) Clearly, ans(q,D,H) C q(D) holds. A BCQ q is 

consistently true in a database D (w.r.t. E), denoted D q, if () G ans(q, D, E). 


3 Dealing with Large Datasets 

To deal with large inconsistent data, we design a strategy that reduces in polynomial 
time the problem of computing the consistent answer to a CQ over a database 
subject to primary key constraints to a collection of smaller problems of the same 
sort. To this end, we exploit the fact that the former problem is logspace Turing 
reducible to the one of deciding whether a BCQ is consistently true (recall that the 
consistent answer to a CQ is a subset of its answer). Hence, given a database D over 
a schema E, and a BCQ q, we would like to identify a set F\,..., of pairwise 
disjoint subsets of D, called fragments, such that: D q iff there is i G [fc] 
such that Fi ?■ At the core of our strategy we have: (1) a cascade pruning 
mechanism to reduce the number of “crucial” inconsistencies, and ( 2 ) a technique 
to identify a suitable set of fragments from any (possibly unpruned) database. For 
the sake of presentation, we start with principle (2). In the last two subsections, we 
provide complementary techniques to further reduce the number of inconsistencies 
to be handled for answering the original CQ. The proofs of this section are given 
in|Appcndix A| 
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Fig. 1. Conflict-join hypergraph. 

3.1 Fragments Identification 

Given a database D , a key component K of D is any maximal subset of D such that 
if n(ti) and r 2 (t 2 ) are in 72, then both n = r 2 and ti = t 2 hold. Namely, K collects 
only atoms that agree on their primary key. Hence, the set of all key components of 
D , denoted by comp(D , E), forms a partition of D. If a key component is a singleton, 
then it is called safe; otherwise it is conflicting. Let comp(D, E) = {ffi,..., K n }. It 
can be verified that rep(D , E) = {{a.!,..., a n } : a 1 £ Ki,... ,a n £ K n }. Let us now 
fix throughout this section a BCQ q over E. For a repair R £ rep(D, E), if q is true in 
R, then there is a substitution p such that p(q) C R. But since R C 77, it also holds 
that p{q) C D. Hence, sub{q , D ) = {p\ var (q) ■ p is a substitution and p{q) C D} is 
an overestimation of the substitutions that map q to the repairs of D. 

Inspired by the notions of conflict-hypergraph (IChomicki and Marcinkowski 20051) 
and conflict-join graph ( iKolaitis and Pema 2012|) . we now introduce the notion of 
conflict-join hypergraph. Given a database D , the conflict-join hypergraph of D 
(w.r.t. q and E) is denoted by Ho = (77,7?), where D are the vertices, and E are 
the hyperedges partitioned in E q = {p(q) : p £ su&(g,77)} and E K = {K : K £ 
comp(D 1 E)}. A bunch B of vertices of Ho is any minimal nonempty subset of D 
such that, for each e £ E, either c C B or e fl B = I holds. Intuitively, every 
edge of Ho collects the atoms in a key component of D or the atoms in p(q), for 
some p £ sub(q,D). Moreover, each bunch collects the vertices of some connected 
component of Ho- Before we proceed further, let us fix these preliminary notions 
with the aid of the following example. 

Example 1 

Consider the schema E = (72., a, k), where 72 = {ri,r 2 }, a(r\) = a(r 2 ) = 2, and 
re(n) = k(t 2 ) = {1}. Consider also the database D = {ri(l,2), n(l,3), r 2 (4,1), 
r 2 (5,l), r 2 (5,2)}, and the BCQ q = ri(A, Y), r 2 (Z , X). The conflicting compo¬ 
nents of D are K\ = (ri(l,2), ri(l,3)} and K$ = {r 2 (5,1), r 2 (5,2)}, while its 
safe component is K 2 = (4,1)}. The repairs of D are R± = (ri(l,2), ^(4,1), 

r 2 (5,1)}, 77 2 = {n(l,2), r 2 (4,1), r 2 (5,2)}, R 3 = {ri(l,3), r 2 (4,1), r 2 (5,1)}, and 
Ra = {ti( 1,3), r 2 (4,1), r 2 (5, 2)}. Moreover, sub(q,D) contains the substitutions: 
/U = {I 4 l,y 4 2,2 h> 4}, /J 2 = {I 4 1,7 4 3,z 4 4}, p 3 = {X ^ 
1, Y i ^ 2, Z i y 5}, and p 4 = {X 1, F 3, Z i->- 5}. The conflict-join hyper¬ 
graph Ho = (77,7?) is depicted in Figure Q] Solid (resp., dashed) edges form the 
set E k (resp., E q ). Since p± maps q to R\ and 77 2 , and p 2 maps q to R 3 and 7 ? 4 , 
we conclude that D q. Finally, D is the only bunch of Ho- □ 

In Example Q] we observe that K 3 can be safely ignored in the evaluation of q. In 
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fact, even if both / 43 (g) and // 4 (g) contain an atom of A 3 , g 1 and g 2 are sufficient 
to prove that g is consistently true. This might suggest to focus only on the set 
F = Ki U A 2 , and on its repairs (ri(l,2), r 2 (4,1)} and {ri(l,3), r 2 (4,1)}. Also, 
since F |=s g, F represents the “small” fragment of D that we need to evaluate q. 
The practical advantage of considering F instead of D should be already clear: (1) 
the repairs of F are smaller than the repairs of D\ and (2 ) F has less repairs than 
D. We are now ready to introduce the the formal notion of fragment. 

Definition 1 

Consider a database D. For any set C C comp(D , E) of key components of £>, we 
say that the set F = (J K&c A is a ( well-defined ) fragment of D. □ 

According to Definition [Q the set A = Ai U A 2 in Example [T] is a fragment of D. 
The following proposition, states a useful property that holds for any fragment. 

Proposition 1 

Consider a database D, and two fragments F\ C F 2 of D. If Fi |=s g, then F 2 |=s g. 

By Definition [T] D is indeed a fragment of itself. Hence, if q is consistently true, 
then there is always the fragment F = D such that F |=s q. But now the question 
is: How can we identify a convenient set of fragments of D7 The naive way would 
be to use as fragments the bunches of Hjj. Soundness is guaranteed by Proposition 
[Q Regarding completeness, we rely on the following result. 

Theorem 1 

Consider a database D. If D |=s g, then there is a bunch B of Hu s.t. B |=£ g. 

By combining Proposition[l]with Theorem[l]we are able to reduce, in polynomial 
time, the original problem into a collection of smaller ones of the same sort. 

3.2 The Cascade Pruning Mechanism 

The technique proposed in the previous section alone is not sufficient to deal with 
large data sets. In fact, since it considers all the bunches of the conflict-join hyper¬ 
graph, it unavoidably involves the entire database. To strengthen its effectiveness, 
we need an algorithm that realizes, for instance, that A 3 is “redundant” in Example 
[U But before that, let us define formally what we mean by the term redundant. 

Definition 2 

A key component A of a database D is called redundant (w.r.t. q) if the following 
condition is satisfied: for each fragment F of D, F |=£ q implies F\K |=£ q. □ 

The above definition states that a key component is redundant independently from 
the fact that some other key component is redundant or not. Therefore: 

Proposition 2 

Consider a database D and a set C of redundant components of D. It holds that 
D |=s q iff (D \ [j KeC K) |=e q. 

In light of Proposition [2j if we can identify all the redundant components of D : 
then after removing from D all these components, what remains is either: (1) a 
nonempty set of (minimal) bunches, each of which entails consistently q whenever 
D j=s g; or (2) the empty set, whenever D g• More formally: 
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Proposition 3 

Given a database D , each key component of D is redundant iff D q. 

However, assuming that ptime ^ np, any algorithm for the identification of all 
the redundant components of D cannot be polynomial because, otherwise, we would 
have a polynomial procedure for solving the original problem. Our goal is therefore 
to identify sufficient conditions to design a pruning mechanism that detects in 
polynomial time as many redundant conflicting components as possible. To give 
an intuition of our pruning mechanism, we look again at Example [T] Actually, K 3 
is redundant because it contains an atom, namely r2(5,2), that is not involved in 
any substitution (see Figured]). Assume now that this is the criterion that we use 
to identify redundant components. Since, by Definition [2] we know that D |=s q 
iff D \ K 3 |=£ q , this means that we can now forget about D and consider only 
D' = K\ U K 2 . But once we focus on sub(q : D'), we realize that it contains only 
pi and p 2 - Then, a smaller number of substitutions in sub(q,D r ) w.r.t. those in 
sub(q,D) motivates us to reapply our criterion. Indeed, there could also be some 
atom in D' not involved in any of the substitutions of sub{q , D'). This is not the 
case in our example since the atoms in D' are covered by gi(q) or P2{d)- However, in 
general, in one or more steps, we can identify more and more redundant components. 
We can now state the main result of this section. 

Theorem 3, 

Consider a database D, and a key component K of D. Let Hjj = {D,E) be the 
conflict-join hypergraph of D. If e ^ 0, then K is redundant. 

In what follows, a redundant component that can be identified via Theorem [2] is 
called strongly redundant. As discussed just before Theorem [2] an indirect effect of 
removing a redundant component K from D is that all the substitutions in the set 
S = {p £ sub(q, D) : p(q) fl K ^ 0} can be in a sense ignored. In fact, sub(q , D \ 
K ) = sub{q , D) \ S. Whenever a substitution can be safely ignored, we say that it 
is unfounded. Let us formalize this new notion in the following definition. 

Definition 3 

Consider a database D. A substitution p of sub(q,D) is unfounded if: for each 
fragment F of D, F |=£ q implies that, for each repair R £ rep(F , E), there exists 
a substitution p' £ sub(q , R) different from p such that p'(q ) C R. □ 

We now show how to detect as many unfounded substitutions as possible. 

Theorem 3 

Consider a database D , and a substitution p £ sub(q , D). If there exists a redundant 
component K of D such that g{q) fl K ^ 0, then p is unfounded. 

Clearly, Theorem [3] alone is not helpful since it relies on the identification of 
redundant components. However, if combined with Theorem [2] it forms the de¬ 
sired cascade pruning mechanism. To this end, we call strongly unfounded an 
unfounded substitution that can be identified by applying Theorem [3] by only 
considering strongly redundant components. Hereafter, let us denote by sus(q 1 D) 
the subset of sub{q , D ) containing only strongly unfounded substitutions. Hence, 
both substitutions P 3 and g/i in Example Q] are strongly unfounded, since K 3 is 
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strongly redundant. Moreover, we reformulate the statement of Theorem [2] by ex¬ 
ploiting the notion of strongly unfounded substitution, and the fact that the set 
K \ U e p£: e is nonempty if and only if there exists an atom a £ K such that the set 
{p £ sub(q , D) : a_ £ p{q)} ~ or equivalently the set {e £ E q : a £ e} - is empty. For 
example, according to Figure[T] the set K 3 \ U e gs e is nonempty since it contains 
the atom r 2 (5,2). But this atoms makes the set {p £ sub(q,D) : r 2 (5, 2) £ p(q)} 
empty since no substitution of sub(q , D) (or no hyperedge of E q ) involves r 2 (5, 2). 
Proposition 4 

A key component A of I? is strongly redundant if there is an atom a£K such that 
one of the two following conditions is satisfied: (1) {p £ sub(q , D) : a £ p(q)} = 0, 
or (2) {p £ sub(q , D) : a£ p(q)} = {p £ sus(q, D) : a£ p(q)}- 
By combining Theorem [3] and Proposition [H we have a declarative (yet induc¬ 
tive) specification of all the strongly redundant components of D. Importantly, the 
process of identifying strongly redundant components and strongly unfounded sub¬ 
stitutions by exhaustively applying Theorem [3] and Proposition [4] is monotone and 
reaches a fixed-point (after no more than \comp(D,T,)\ steps) when no more key 
component can be marked as strongly redundant. 

3.3 Idle Attributes 

Previously, we have described a technique to reduce inconsistencies by progres¬ 
sively eliminating key components that are involved in query substitutions but are 
redundant. In the following, we show how to reduce inconsistencies by reducing the 
cardinality of conflicting components, which in some cases can be even treated as 
safe components. 

The act of removing an attribute r[i] from a triple ( q , D , E) consists of reducing 
the arity of r by one, cutting down the *-th term of each r-atom of D and q , and 
adapting the positions of the primary key of r accordingly. Moreover, let affrs(E) = 
{r[*]|r £ 7 Z and i £ [a(r)]}, let B C attrs(S), and let A = attrs{ S) \ B. The 
projection of ( q , D, E) on A , denoted by Il^g, D, E), is the triple that is obtained 
from (< 7 , D , E) by removing all the attributes of B. Consider a CQ q and a predicate 
r £ IZ. The attribute r[i\ is relevant (w.r.t. q) if q contains an atom of the form 
r(ti ,..., f a ( r )) such that at least one of the following conditions is satisfied: (1) 
i £ n(r)', or (2) ti is a constant; or (3) L is a variable that occurs more than once in 
q; or (4) t, is a free variable of q. An attribute which is not relevant is idle (w.r.t. 
q). An example is reported in | Appendix B| The following theorem states that the 
consistent answer to a CQ does not change after removing the idle attributes. 
Theorem 4 

Consider a CQ q , the set R = {r[i] £ affrs(E) | r[i] is relevant w.r.t. q}, and a 
database D. It holds that ans(q , D , E) = ans(IIfl(g, D , E)). 

3-4 Conjunctive Queries and Safe Answers 

Let E be a relational schema, D be a database, and q = 3Y <p(X, Y) be a CQ, 
where we assume that E contains only relevant attributes w.r.t. q (idle attributes, 
if any, have been already removed). Since ans{q , D , E) C q(D ) 1 for each candidate 
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answer t c £ q{D), one should evaluate whether the BCQ q = ip(t c ,Y) is (or is 
not) consistently true in D. Before constructing the conflict-join hypergraph of D 
(w.r.t. q and E), however, one could check whether there is a substitution p that 
maps q to D with the following property: for each a £ p(q), the singleton {a} 
is a safe component of D. And, if so, it is possible to conclude immediately that 
t c £ ans(q, D, E). Intuitively, whenever the above property is satisfied, we say that 
t c is a safe answer to q because, for each R £ rep(D,£), it is guaranteed that 
fi(q) C R. The next result follows. 

Theorem 5 

Consider a CQ q = 3Y y>(X, Y), and a tuple t c of q(D). If there is a substitution \i 
s.t. each atom of n((p(t c , Y)) forms a safe component of D 1 then t c £ ans(q, D, E). 

4 The Encoding in ASP 

In this section, we propose an ASP-based encoding to CQA that implements the 
techniques described in Sectional and that is able to deal directly with CQs, instead 
of evaluating separately the associated BCQs. Hereafter, we assume the reader is fa¬ 
miliar with Answer Set Programming (IGelfond and Lifschitz 199ltlBrcwka et al. 20111) 
and with the standard syntax of ASP competitions (ICalimeri et al. 20141) . A nice 
introduction to ASP can be found in (■Baral 2003jl . and in the ASP Core 2.0 specifi¬ 
cation in (ICalimeri et al. 20131) . Given a relational schema E = (7 Z, a, re), a database 
D , and a CQ q = 3Y yj(X, Y), we construct a program P(q, E) s.t. a tuple t £ q(D) 
belongs to ans(q , D , E) iff each answer set of D U P(q, E) contains an atom of the 
form q*(c, t), for some constant c. Importantly, a large part of P(q, E) does not 
depend on 5 or E. To lighten the presentation, we provide a simplified version of 
the encoding that has been used in our experiments. In fact, for efficiency reasons, 
idle attributes should be “ignored on-the-fly” without materializing the projection 
of (q,D, E) on the relevant attributes; but this makes the encoding a little more 
heavy. Hence, we first provide a naive way to consider only the relevant attributes, 
and them we will assume that £ contains no idle attribute. Let R collect all the 
attributes of E that are relevant w.r.t. q. For each r £ 1Z that occurs in q , let W 
be a sequence of a(r) different variables and S = {i|r[i] £ R}, the terms of the 
r-atoms of D that are associated to idle attributes can be removed via the rule 
r, (W|s) r(W). Hereafter, let us assume that £ contains no idle attribute, and 

Z = X U Y. Program P(q, E) is depicted in Figure [2] 

Computation of the safe answer. Via rule 1, we identify the set M. = {p\z '■ M is 
a substitution and /x(<^(X, Y)) C D}. It is now possible (rule 2) to identify the 
atoms of D that are involved in some substitution. Here, for each atom r(t) £ q , 
we recall that t is the subsequence of t containing the terms in the positions of 
the primary key of r, and we assume that t are the terms of t in the remaining 
positions. In particular, we use two function symbols, k r and nk r , to group the 
terms in the key of r and the remaining ones, respectively. It is now easy (rule 3) to 
identify the conflicting components involved in some substitution. Let y>(X, Y) = 
ri(ti),..., r„(t n ). We now compute (rule 4) the safe answers. 

Hypergraph construction. For each candidate answer t c £ q{D) that has not been 
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% Computation of the safe answer. 

1 sub(Zi) </?(Z). 

2 involved At om(]z. r (t), nk r (t)) su 6 (Z), r(t). Vr(t) E q 

3 confComp(K) involvedAtom(K,NKi), involvedAtom(K,NK 2 ), NKi ^ NK 2 . 

4 safeAns(X.) sw 6 (Z), not con f Comp {k. ri (ti)), not confComp(k Tn (t n )). 

°/ 0 Hypergraph Construction. 

5 subEq(sID(Zi), ans(X)) suft(Z), not safeAns(X.). 

6 corap-Efc(k r (t), Ans) sw 6 ^q , (sID(Z), Ans). Vr(t) E 9 

7 m 5 , u 6 £'( 3 r (atom r (t), sID(Z)) siife£'g , (sID(Z), _). Vr(t) E 9 

8 inCompEk(a.tom r (t), k r (t)) compEk(k r (t), _), mi>o£i>ecMto? 7 i(k r (t), nk r (t)). Vr(t) E q 

*/, Pruning. 

9 redComp(K, Ans) comp£’fc(K, Ans), inCompEk(A, K), 

#count{S : inSubEq( A, S), subEq(S, Ans)} = 0. 

10 unfSub(S, Ans) subEq( S,Ans), inSubEq(A,S ), inCompEk(A 1 K), redComp(K, Ans). 

11 redComp(K, Ans) compEk{ K,Ans), mComp£fc(A, K), 

#count{S : inSubEq(A : S), subEq(S, Ans)} = #count{S : inSubEq(A, S), unfSub(S, Ans)}. 

12 residualSub( S, Ans) subEq(S, Ans), not unfSub(S, Ans). 

°/ 0 Fragments identification. 

13 shareSub(Ki, K 2 , varsAns) residual Sub(S, Ans), inSubEq( Ai,S), inSubEq( A 2 ,S), 

Ai 7 ^ A 2 , inCompEk(Ai, Ki), inCompEk(A 2 , K 2 ), Ki 7 ^ K 2 . 

14 ancestorO/(Ki, K 2 , Ans) s/iareS'u 6 (Ki, K 2 , Ans), Ki < K 2 . 

15 ancestorO/(Ki, K 3 , Ans) ancestorOf(Ki, K 2 , Ans), s/iare£>u 6 (K 2 , K 3 , Ans), Ki < K 3 . 

16 child( K, Ans) ancestorOf (_, K, Ans). 

17 /ceyCorap/nFm<;(Ki, f ID(Ki, Ans)) ancestorOf(K \,_, Ans), not child(K i,Ans). 

18 keyCompInFrag(K 2 , f ID(Ki, Ans)) ancestorO/(Ki, K 2 , Ans), not c/ii/d(Ki, Ans). 

19 subInFrag(S, fID(KF, Ans)) residualSub(S, Ans), inSubEq( A, S), mCompFfcfA, K), 

keyCompInFrag(K , f ID(KF, Ans)). 

20 frag(f ID(K, Ans), Ans) keyCompInFrag(_, f ID(K, Ans)). 

7, Repair construction. 

21 1 ^ {a,ctu;e.Fra<;(F) : /ragfF, Ans)} ^ 1 frag(_, _). 

22 1 ^ {actineTltom(A) : inCompEk( A, K)} ^ 1 acfive-Frag^F), keyCompInFrag( K, F). 

23 ignoredSub(S) activeFrag( F), subInFrag{ S, F), inSubEq(A,S), not activeAtom( A). 

% New query. 

24 g*(s,Xi,..., X n ) sa/eAns(Xi,..., X n ). 

25 <j*(F, Xi,.. ., X n ) /ra^(F, ans(Xi,..., X n )), not activeFrag(F). 

26 q*( F, Xi,... ,X n ) activeFrag( F), subInFrag( S, F), not ignoredSub( S), 

/ra#(F, ans(Xi,... ,X n )). 

Fig. 2. The Encoding in ASP. 

already recognized as safe, we construct the hypergraph Hd{ t c ) = (- 0 , E) associated 
to the BCQ </?(t c , Y), where E = as usual. Hypergraph Hd( t c ) is identified 

by the functional term ans(t c ), the substitutions of E q (collected via rule 5) are 
identified by the set (sID(/i(Z ))\fi E M and /i(X) = t c } of functional terms, 
while the key components of E K (collected via rule 6 ) are identified by the set 
{k r (/i(t))|/i E M and /i(X) = t c and r(t) E q} of functional terms. To complete 
the construction of the various hypergraphs, we need to specify (rules 7 and 8 ) 
which are the atoms in each hyperedge. 

Pruning. We are now ready to identify (rules 9 — 11) the strongly redundant com¬ 
ponents and the strongly unfounded substitutions (as described in Section [3j) to 
implement our cascade pruning mechanism. Hence, it is not difficult to collect (rule 
12 ) the substitutions that are not unfounded, that we call residual. 

Fragments identification. Key components involving at least a residual substitution 
(i.e., not redundant ones), can be aggregated in fragments (rules 13 — 20) by using 





10 


M. Manna and F. Ricca and G. Terracina 


600 
500 
„ 400 

§ 300 

H 200 
100 
0 

0 200 400 600 800 

Answered Queries 

(a) Cactus plot. 


Pruning - 


MRT 


BB - 


- 

- 

/ 

- 

/.■ 

- 

s' ■ 

- 



-j'-' 


CD (/) 

CD 

<l)- c 

5° 


120 

90 

60 

30 

0 

84 

74 

64 


01 23456789 10 

Database Size 



(b) Performance avg time and solved. 


Fig. 3. Comparison with alternative encodings: answered queries and execution 
time. 


the notion of bunch introduced in Section EU In particular, any given fragment F 
- associated to a candidate answer t c £ l(D), and collecting the key components 
Ki,..., K m - is identified by the functional term fID(IQ,t c ) where, for each j £ 
{1 ,..., m} \ {z}, the functional term associated to Ki lexicographically precedes 
the functional term associated to Kj. 

Repair construction. Rules 1 — 20 can be evaluated in polynomial time and have 
only one answer set, while the remaining part of the program cannot in general. In 
particular, rules 21 — 23 generate the search space. Actually, each answer set M of 
P(q,T,) is associated (rule 21) with only one fragment, say F , that we call active 
in M. Moreover, for each key component K of F , answer set M is also associated 
(rule 22) with only one atom of K, that we also call active in M. Consequently, 
each substitution which involves atoms of F but also at least one atom which is not 
active, must be ignored in M (rule 23). 

New query. Finally, we compute the atoms of the form q*{c, t) via rules 24 — 26. 


5 Experimental Evaluation 

The experiment for assessing the effectiveness of our approach is described in the 
following. We first describe the benchmark setup and, then, we analyze the results. 

Benchmark Setup. The assessment of our approach was done using a benchmark em¬ 
ployed in the literature for testing CQA systems on large inconsistent databases (IKolaitis ct al. 20131) . 
It comprises 40 instances of a database schema with 10 tables, organized in four 
families of 10 instances each of which contains tables of size varying from 100 k 
to 1M tuples; also it includes 21 queries of different structural features split into 
three groups depending on whether CQA complexity is coNP-complete (queries 
<2i, ■ ■ ■, Q 7 ), PTIME but not FO-rewritable QWijsen 2009D (queries Qs, • ■ •, Q 14 ), 
and FO-rewritable (queries Q 15 , ■ ■ ■, Q 2 i)- (See |Appcndix CD - 

We compare our approach, named Pruning, with two alternative ASP-based ap¬ 
proaches. In particular, we considered one of the first encoding of CQA in ASP 
that was introduced in (IBarcelo and Bcrtossi 20031) . and an optimized technique 
that was introduced more recently in (lIManna et al. 20131) : these are named BB 
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Fig. 4. Scalability and overhead of consistent query answering with Pruning encod¬ 
ing. 


and MRT , respectively. BB and MRT can handle a larger class of integrity con¬ 
strains than Pruning , and only MRT features specific optimization that apply also 
to primary key violations handling. We constructed the three alternative encod¬ 
ings for all 21 queries of the benchmark, and we run them on the ASP solver 
WASP 2.0 (lAlviano et al. 2014bl) . configured with the iterative coherence test¬ 
ing algorithm (lAlviano et al. 2014al) . coupled with the grounder Gringo ver. 4.4.0 
(IGebser et al. 20lT1) .For completeness we have also run clasp ver. 3.1.1 (IGebser et al. 20131) 
obtaining similar results. WASP performed better in terms of number of solved in¬ 
stances on MRT and BB. The experiment was run on a Debian server equipped 
with Xeon E5-4610 CPUs and 128GB of RAM. In each execution, resource us¬ 
age was limited to 600 seconds and 16GB of RAM. Execution times include the 
entire computation, i.e., both grounding and solving. All the material for reproduc¬ 
ing the experiment (ASP programs, and solver binaries) can be downloaded from 
www.mat.unical.it/ricca/downloads/mrtICLP2015.zip. 


Analysis of the results. Concerning the capability of providing an answer to a query 
within the time limit, we report that Pruning was able to answer the queries in 
all the 840 runs in the benchmark with an average time of 14.6s. MRT, and BB 
solved only 778, and 768 instances within 600 seconds, with an average of 80.5s 
and 52.3s, respectively. The cactus plot in Figure 3(a) provides an aggregate view 
of the performance of the compared methods. Recall that a cactus plot reports for 
each method the number of answered queries (solved instances) in a given time. We 
observe that the line corresponding to Pruning in Figure 3(a) is always below the 
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ones of MRT and BB. In more detail, Pruning execution times grow almost linearly 
with the number of answered queries, whereas MRT and BB show an exponential 
behavior. We also note that MRT behaves better than BB, and this is due to the 
optimizations done in MRT that reduce the search space. 

The performance of the approaches w.r.t. the size of the database is studied 
in Figure |3(b)| The x-axis reports the number of tuples per relation in tenth of 
thousands, in the upper plot is reported the number of queries answered in 600s, 
and in the lower plot is reported the corresponding the average running time. We 
observe that all the approaches can answer all 84 queries (21 queries per 4 databases) 
up to the size of 300k tuples, then the number of answered queries by both BB and 
MRT starts decreasing. Indeed, they can answer respectively 74 and 75 queries of 
size 600k tuples, and only 67 and 71 queries on the largest databases (1M tuples). 
Instead, Pruning is able to solve all the queries in the data set. The average time 
elapsed by running Pruning grows linearly from 2.4s up to 27.4s. MRT and BB 
average times show a non-linear growth and peak at 128.9s and 85.2s, respectively. 
(Average is computed on queries answered in 600s, this explains why it apparently 
decreases when a method cannot answer some instance within 600s.) 

The scalability of Pruning is studied in detail for each query in Figures HKd- 
f), each plotting the average execution times per group of queries of the same 
theoretical complexity. It is worth noting that Pruning scales almost linearly in all 
queries, and independently from the complexity class of the query. This is because 
Pruning is able to identify and deal efficiently with the conflicting fragments. 

We now analyze the performance of Pruning from the perspective of a measure 
called overhead, which was employed in (IKolaitis ct al. 2013j) for measuring the 
performance of CQA systems. Given a query Q the overhead is given by jf 2 -, 
where t cqa is time needed for computing the consistent answer of Q, and t v i a i n is the 
time needed for a plain execution of Q where the violation of integrity constraints 
are ignored. Note that the overhead measure is independent of the hardware and 
the software employed, since it relates the computation of CQA to the execution 
of a plain query on the same system. Thus it allows for a direct comparison of 
Pruning with other methods having known overheads. Following what was done 
in (jKolaitis et al. 20131) . we computed the average overhead measured varying the 
database size for each query, and we report the results by grouping queries per 
complexity class in Figures |4](a-c). The overheads of Pruning is always below 2.1, 
and the majority of queries has overheads of around 1.5. The behavior is basically 
ideal for query Q5 and Q4 (overhead is about 1). The state of the art approach 
described in (IKolaitis et al. 20131) has overheads that range between 5 and 2.8 on 
the very same dataset (more details on | Appendix C| ). Thus, our approach allows 
to obtain a very effective implementation of CQA in ASP with an overhead that 
is often more than two times smaller than the one of state-of-the-art approaches. 
We complemented this analysis by measuring also the overhead of Pruning w.r.t. 
the computation of safe answers, which provide an underestimate of consistent 
answers that can be computed efficiently (in polynomial time) by means of stratified 
ASP programs. We report that the computation of the consistent answer with 
Pruning requires only at most 1.5 times more in average than computing the safe 
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Fig. 5. Average execution times per evaluation step. 

answer (detailed plots in |Appcndix C| ). This further outlines that Pruning is able 
to maintain reasonable the impact of the hard-to-evaluate component of CQA. 

Finally, we have analyzed the impact of our technique in the various solving 
steps of the evaluation. The first three histograms in Figure [5] report the average 
running time spent for answering queries in databases of growing size for Pruning 
(Fig. |5(a)| ), BB (Fig. |5(b)| ), and MRT (Fig. 5(c) I. In each bar different colors 
distinguish the average time spent for grounding and solving. In particular, the 
average solving time over queries answered within the timeout is labeled Solving- 
sol, and each bar extends up to the average cumulative execution time computed 
over all instances, where each timed out execution counts 600s. Recall that, roughly 
speaking, the grounder solves stratified normal programs, and the hard part of the 
computation is performed by the solver on the residual non-stratified program; 
thus, we additionally report in Figure [5(d)| the average number of facts (knowledge 
inferred by grounding) and of non-factual rules (to be evaluated by the solver) in 
percentage of the total for the three compared approaches. The data in Figure [5] 
confirm that most of the computation is done with Pruning during the grounding, 
whereas this is not the case for MRT and BB. F igure [5 ( d) | shows that for Pruning 
the grounder produces a few non-factual rules (below 1% in average), whereas 
MRT and BB produce 5% and 63% of non-factual rules, respectively. Roughly, this 
corresponds to about 23K non-factual rules (resp., 375K non-factual rules) every 
100K tuples per relation for MRT (resp., BB), whereas our approach produces no 
more than 650 non-factual rules every 100K tuples per relation. 


6 Conclusion 

Logic programming approaches to CQA were recently considered not competi¬ 
tive (IKolaitis et al. 2013)1 on large databases affected by primary key violations. 
In this paper, we proposed a new strategy based on a cascade pruning mechanism 
that dramatically reduces the number of primary key violations to be handled to 
answer the query. The strategy is encoded naturally in ASP, and an experiment on 
benchmarks already employed in the literature demonstrates that our ASP-based 
approach is efficient on large datasets, and performs better than state-of-the-art 
methods in terms of overhead. As far as future work is concerned, we plan to ex¬ 
tend the Pruning method for handling inclusion dependencies, and other tractable 
classes of tuple-generating dependencies. 
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Appendix A Proofs 

Here we report the proofs of Theorems and Propositions reported in Section [3] 

A. 1 - Proof of Proposition Q] 

Let us assume that F\ |=s q. This means that q is true in every repair of F\. Since, 
by definition, for each repair R 2 of F 2 , there exists a repair Ri of F\ such that 
Ri Q R 2 , we conclude that q must be true also in every repair of F 2 . 

A.2 - Proof of Theorem [7] 

We we will prove the contrapositive. To this end, let be the bunches 

of Ftp- Assume that, for each i G [k], Bi q- This means that, for each i G 
[fc], there exists a repair Ri G rep(Bi, E) such that Ri q. Consider now the 
instance R = Uie[fc] Ri- Since B\ ,..., B^ always form a partition of D , since for 
each /i G sub(q : D) : p(q) is entirely contained in exactly one bunch, and since each 
key component of D is entirely contained in exactly one bunch, we conclude that 
R is a repair of D and R q. Hence D Y=y, < 7 . 

A.3 - Proof of Proposition [5| 

(=>) If D q , then by Proposition Q] we have that, for each fragment F of D, 

F Y=y, q- Moreover, by rephrasing Definition [21 we have that any key component K 
of D is redundant if the following condition is satisfied: for each fragment F of D, 
F qV F\K [=£ q. Hence, by combining the two, we conclude that each key 
component of D is redundant. 

(<=) If each key component K of D is redundant, by Proposition El we can 
conclude that D q , since the empty database cannot entail q. 

A.4 - Proof of Theorem [H 

Let F be a fragment of D such that F |=s q. By considering F as a database 
and by Theorem [TJ we have that there exists at least a bunch B of the conflict-join 
hypergraph Hp of F such that B q. If KGB = 0, then F\K D B, and therefore, 
by Proposition [I] since B is a fragment of F \ K, we have that F \ K |=s q. If 
K C B, then let us consider one of the atoms a G K that is not involved in any 
substitution. But since q is true in every repair of B containing a, this means that 
q is true also in every repair of B\K. And since B\K is a fragment of F\K , also 
in this case we can conclude that F\K |=s q. 

A.5 - Proof of Theorem [77] 

Let A be a redundant component of D , and p, be a substitution of sub(q, D ) such 
that p(q) flit ^ 0. Moreover, let A be a fragment of D such that F |=s q. Since K 
is redundant, by Definition [21 we have that F\K |=s q. But since p(q) necessarily 
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contains an atom of K, this means that for each repair R £ rep(F \ K, E), there 
exists a substitution // £ sub(q , R) different from p such that fi'(q) C R. But since 
the union of all these substitutions different from p can be also used to entail q in 
every repair of F, by Definition [3j we can conclude that p, is unfounded. 


Appendix B - Example of relevant and idle attributes 

Consider, for example, the schema E = (7 Z,a,n), where 7 Z = {ri,r 2 }, a(r±) = 3, 
a(r 2 ) = 2, and n{r\) = k(t 2 ) = {1}. Consider also the database D = (ri(l,2,3), 
ri(l,2,4), r 2 (2,5)}, and the BCQ q = 3X3Y ri(l,X, F),r 2 (X, 5). The key com¬ 
ponents of D are K\ = (ri(l,2,3), ri(l,2,4)} and K 2 = (r 2 (2,5)}, while the 
repairs of D and E are R\ = {ri(l,2,3), r 2 (2,5)} and R 2 = (ri(l,2,4), r 2 (2,5)}. 
Moreover, the set sub{q,D) contains substitutions pi = {X n- 2, F i->- 3} and 
/r 2 = {X i —> 2, Y i->- 4}. Finally, since pi maps q to R\, and p 2 maps q to R 2 , we 
can conclude that D |=£ q. However, one can observe that Ki could be considered 
as a safe component with respect to q. In fact, variable F of q - being in a position 
that does not belong to n(ri) - occurs only once in q. And this intuitively means that 
whenever there exists a substitution that maps q in a repair containing ri(l,2,3), 
there must exist also a substitution that maps q in a repair containing ri(l,2,4). 
Therefore, to avoid that K\ produces two repairs, one can consider only the first two 
attributes of r\ and modify q accordingly. Hence, we can consider E' = (72.', a' ,«/), 
where 7 Z' = {r{, r 2 }, a'(r[) = a'(r 2 ) = 2 and n'(r[) = «/(r 2 ) = {1}, the database 
7 1' = {r((l, 2), r 2 (2,5)}, and the BCQ q' = 3X3 Y r{( 1, X), r 2 (X, 5). Clearly, D' is 
now consistent and entails q'. 


Appendix C - Details on Benchmarks 

The benchmark considered in the paper was firstly used in (IKolaitis ct al. 20131) . 
It comprises several instances of varying size of a synthetic database specifically 
conceived to simulate reasonably high selectivities of the joins and a large number 
of potential answers. Moreover it includes a set of queries of varying complexity 
and 40 instances of a randomly generated database. In the following we report the 
main characteristics of the data set and a link to an archive where the encoding and 
the binaries of the ASP system employed in the experiment can be also obtained. 

C. 1 Queries 

It contains the following queries organized in groups depending on the respective 
complexity of CQA (existential quantifiers are omitted for simplicity): 

• co-NP, not first-order rewritable 

Q 1 () = r 5 (X,Y,Z),r 6 (X 1 ,Y, W) 

Q 2 (Z) = r 5 (X , Y,Z)MXi, W) 

Qs(Z , W) = r 5 (X, F, Z ), r 6 (X u F, W) 

Q 4 () = r 5 (X, F, Z), r 6 (X!, F, F), r 7 ( F, U, D) 
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Qs(Z) = r 5 (X, V, Z), r 6 (X u Y, Y), rr(Y, U, D) 

Qe(Z , W) = r 5 (A, Y,Z),r e (X l: Y , U,D) 

Qr(Z, W, D) = r 5 (X, Y, Z), r 6 (X 1; Y, W), r 7 ( Y, U, D) 

• PTIME, not first order rewritable 

QsO = r 3 (X : Y,Z),r 4 (Y,X, W) 

Q g (Z) = r 3 (X, Y,Z),r 4 (Y,X, W) 

Qio(Z, W) = r 3 (X, Y,Z),n(Y,X, W) 

Q11Q = r 3 (X, Y,Z),r 4 (Y,X, W),r 7 (Y, U,D) 

QM = r 3 (X, Y,Z),r 4 (Y,X, W),r 7 (Y, U,D) 

QuiZ, W) = r 3 (X, Y, Z), r 4 (Y ,X, W),r r (Y , U,D) 
QuiZ, W, D) = r 3 iX, Y, Z), r 4 ( Y, X, W), ryiY, U, D) 

• First order rewritable 

Q 15 iZ) = niX,Y,Z),r 2 iY, V, W) 

QuiZ, W) = niX, Y,Z),r 2 iY, V, W) 

Q 17 iZ) = niX, Y,Z),r 2 iY, V),r 7 iV, U,D) 

QuiZ, W) = niX, Y, Z), r 2 iY, V), ty( V, U, D) 
QuiZ) = ri iX,Y,Z),r 8 iY, V, W) 

Q 2 oiZ ) = r 5 (X, Y, Z), n iX u Y, W), r 9 (X, Y, D) 
Q 2 iiZ) = r 3 (X, Y,Z),n(Y,X, W),nOiX, Y,D) 


C.2 Datasets 

We used exactly the same datasets employed in (IKolaitis et al. 20l3l) . It comprises 
40 samples of the same database, organized in four families of 10 instances each 
of which contains 10 tables of size varying from 100000 to 100000 tuples with 
increments 100000. Quoting (|Kolaitis et al. 2013)) . the generation of databases has 
been done according with the following criterion: ” For every two atoms Ri, Rj that 
share variables in any of the queries, approximately 25 of the facts in Ri join with 
some fact in Rj , and vice-versa. The third attribute in all of the ternary relations, 
which is sometimes projected out and never used as a join attribute in Table 1, 
takes values from a uniform distribution in the range [1, rsize/ 10]. Hence, in each 
relation, there are approximately rsize/10 distinct values in the third attribute, 
each value appearing approximately 10 times.” 

C. 3 Encodings and Binaries 

We refrain from reporting here all the ASP encodings employed in the experiment 
since they are very lengthy. Instead we report as an example the ASP program 
used for answering query Q7, and provide all the material in an archive that can be 
downloaded from www.mat.unical.it/ricca/downloads/mrtICLP2015.zip. The 
zip package also contains the binaries of the ASP system employed in the experi¬ 
ment. 


C-4 Pruning encoding of query Q 7 


Let us classify the variables of Q 7 : 
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• All the variables are: {A, Y, Z , Xi, W, U, D}; 

• The free variables are: { Z , W, D }; 

• The variables involved in some join are: { Y}; 

• The variables in primary-key positions are: {X , X\, Y}; 

• The variable in idle positions are: {U} 

• The variable occurring in relevant positions are: { X , Y, Y, Xi, W, D} 

Computation of the safe answer. 

sub(X,Y,Z,Xl,W,D) r5(X,Y,Z), r6(Xl,Y,W), r7(Y,U,D). 

involvedAtom(k-r5(X), nk-r5(V2,Y3)) sub(X,Y,Z,X1,W,D), r5(X,V2,Y3). 
involvedAtom(k-r6(Xl), nk-r6(V2,V3)) sub(X,Y,Z,X1,W,D), r6(XI,V2,Y3). 
involvedAtom(k-r7(Y), nk-r7(Y3)) :- sub(X,Y,Z,X1,W,D), r7(Y,V2,V3). 

confComp(K) involvedAtom(K,NKl), involvedAtom(K,NK2), NK1 > NK2. 

safeAns(Z,W,D) sub(X,Y,Z,Xl,W,D), not confComp(k-r5(X)), 

not confComp(k-r6(Xl)), not confComp(k-r7(Y)). 

Hypergraph construction. 

subEq(sID(X,Y,Z,Xl,W,D), ans(Z,W,D)) sub(X,Y,Z,X1,W,D), not safeAns(Z,W,D). 

compEk(k-r5(X), Ans) :- subEq(sID(X,Y,Z,X1,W,D), Ans). 
compEk(k-r6(Xl), Ans) subEq(sID(X,Y,Z,X1,W,D), Ans). 

compEk(k-r7(Y), Ans) :- subEq(sID(X,Y,Z,Xl,W,D), Ans). 

inSubEq(atom-r5(X,Y,Z), sID(X,Y,Z,X1,W,D)) subEq(sID(X,Y,Z,Xl,W,D), _). 

inSubEq(atom-r6(Xl,Y,W) , sID(X,Y,Z,X1,W,D)) subEq(sID(X,Y,Z,Xl,W,D) , _). 
inSubEq(atom-r7(Y,D), sID(X,Y,Z,X1,W,D)) :- subEq(sID(X,Y,Z,X1,W,D), _). 

inCompEk(atom-r5(X,V2,V3), k-r5(X)) compEk(k-r5(X), Ans), 

involvedAtom(k-r5(X), nk-r5(Y2,V3)). 
inCompEk(atom-r6(XI,V2,V3), k-r6(Xl)) compEk(k-r6(Xl), Ans), 

involvedAtom(k-r6(XI), nk-r6(V2,V3)). 
inCompEk(atom-r7(Y,V3), k-r7(Y)) compEk(k-r7(Y), Ans), 

involvedAtom(k-r7(Y), nk-r7(Y3)). 


Pruning. 

redComp(K,Ans) compEk(K,Ans), inCompEk(A,K), 

#count{S: inSubEq(A,S), subEq(S,Ans)} = 0. 

unfSub(S,Ans) subEq(S,Ans), inSubEq(A,S), inCompEk(A,K), redComp(K,Ans). 

redComp(K,Ans) compEk(K,Ans), inCompEk(A,K), 

X = #count{S: inSubEq(A,S), subEq(S,Ans)} 

#count{S: inSubEq(A,S), unfSub(S,Ans)} >= X. 

residualSub(S,Ans) subEq(S,Ans), not unfSub(S,Ans). 

Fragments identification. 

shareSub(Kl,K2,Ans) residualSub(S,Ans), inSubEq(Al,S), inSubEq(A2,S), 

A1 <> A2, inCompEk(Al,K1), inCompEk(A2,K2), K1 <> K2. 

ancestorOf(K1,K2,Ans) shareSub(Kl,K2,Ans), K1 < K2. 

ancestorOf(K1,K3,Ans) ancestorOf(K1,K2,Ans), shareSub(K2,K3,Ans), K1 < K3. 
child(K,Ans) ancestorOf(_,K,Ans). 
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keyCompInFrag(Kl, fID(K1,Ans)) ancestorOf(K1,Ans), not child(Kl,Ans). 
keyCompInFrag(K2, fID(K1,Ans)) ancestorOf(K1,K2,Ans), not child(Kl,Ans). 

subInFrag(S,fID(KF,Ans)) residualSub(S,Ans), inSubEq(A,S), 

inCompEk(A,K), keyCompInFrag(K,fID(KF,Ans)). 

frag(fID(K,Ans),Ans) keyComp!nFrag(_,fID(K,Ans)). 


Repairs Construction. 

1 <= jactiveFrag(F):frag(F,Ans)} <= 1 frag(_,_). 

1 <= {activeAtom(A):inCompEk(A.K)} <= 1 activeFrag(F), keyCompInFrag(K,F). 
ignoredSub(S) activeFrag(F), sublnFragCS,F), inSubEq(A.S), not activeAtom(A). 


New query. 

q*(s,Z,W,D) safeAns(Z,W,D). 

q*(F,Z,W,D) frag(F,ans(Z,W,D)), not activeFrag(F). 

q*(F,Z,W,D) activeFrag(F), sub!nFrag(S,F), not ignoredSub(S), frag(F,ans(Z,W,D)). 


Appendix D - Additional Plots 

We report in this appendix some additional plots. In particular, we provide (?) 
detailed plots for the overhead of Pruning w.r.t. safe answer computation; (ii) 
scatter plots comparing, execution by execution, Pruning with BB and MRT ; and, 
(in) an extract of (jKolaitis et al. 201,'1j) concerning the overhead measured for the 
MIP-based approach for easing direct comparison with our results. 

Overhead w.r.t. Safe Answers. We report in the following the detailed plots con¬ 
cerning the overhead of Pruning w.r.t. the computation of safe answers. The results 
are reported in three plots grouping queries per complexity class in Figures [D II 



0123456789 10 
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(a) Pruning/Safe (co-NP) 


(b) Pruning/Safe (P) 


(c) Pruning/Safe (FO) 


Fig. D 1. Overhead of consistent query answering w.r.t. safe answers. 


It can be noted that the computation of consistent answers with Pruning takes at 
most to 1.5 times more than computing the safe answers in average, and is usually 
of about 1.2 times. 
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Scatter Plots. One might wonder what is the picture if the ASP-based approaches 
are compared instance-wise. An instance by instance comparison of Pruning with 
BB and MRT , is reported in the scatter plots in Figure lD~2l In these plots a point 
(x,y) is reported for each query, where x is the running time of Pruning , and y 
is the running time of BB and MRT, respectively in Figure [2(b)| and Figure [2(a)| 
The plots also report a dotted line representing the secant (x = y), points along 
this line indicates identical performance, points above the line represent the queries 
where the method on the ir-axis performs better that the one in the y- axis and 
vice versa. Figure [3] clearly indicates that Pruning is also instance-wise superior to 
alternative methods. 




Pruning - Time (s) 


Pruning - Time (s) 


(a) Pruning vs MRT. 


(b) Pruning vs BB. 



MRT - Time (s) 


(c) BB vs MRT. 


Fig. D2. Instance-wiese comparison with alternative encodings. 


Overhead of MIP approach from Kolaitis et. al (2013). 
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Fig. D3. Overhead of EQUIP for computing consistent answers of coNP-hard 
queries Qi-Qi- 
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Fig. D4. Overhead of EQUIP for computing consistent answers of PTIME, but 
not-first-order rewritable queries Qg,-Qi^. 



Fig. D5. Overhead of EQUIP for computing consistent answers of first-order 
rewritable queries < 3 i 5 -Q 2 i- 














